Using SQL in Python + Visualizations

1.Introduction

The Chinook database is a sample database for the SQL server, Oracle, MySQL, Postgres etc. It is ideal for demos, practicing and familiarzing oneself with relational database management. The Chinook data model represents a digital media retailer. It includes tables for music artists, music tracks, invoices, and customers.

Below is the entity relationship diagram of the Chinook data model. image.png

2. Running SQLite in Jupyter notebook

  1. Import SQLite into the Jupyter notebook
  2. The connect() function is used create a database in the environment and maintains the connection while working with that database. While I am connected, other users will not be able to access the database at the same time, unless I close the connection using close()
  3. Each SQL query is passed through a string ("x", """x""")
  4. Running SQL with pandas provides plotting options through libraries such as seaborn matplotplib, plotly express etc.
In [1]:
# import SQLite and pandas 
import sqlite3
import pandas as pd 

# plotting library
import plotly.express as px

# connect to database
sql_connect = sqlite3.connect('chinook.db')

# cursor function to assist with executing queries
cursor = sql_connect.cursor()
C:\Users\ke117\Anaconda3\lib\site-packages\dask\config.py:168: YAMLLoadWarning:

calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

C:\Users\ke117\Anaconda3\lib\site-packages\distributed\config.py:20: YAMLLoadWarning:

calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

3. Data table overview

  • Saving queries into objects is recommended to avoid re-writing the whole query block again, should it be necessary to call it later.
  • The cursor function converts the executed query results into tuples and is stored as a local object.
  • The function fetchall() retrieves the results.
  • Pandas is required to actually run the saved query.
  • Standard pandas functions can still be applied to query results
In [2]:
# Save SQL query as a string, where sqlite_master is the table
query = "SELECT * FROM sqlite_master LIMIT 5;"

# Execute query using cursor object 
results = cursor.execute(query).fetchall()

# Run, output will be similar to a pandas dataframe
pd.read_sql_query(query, sql_connect)
Out[2]:
type name tbl_name rootpage sql
0 table albums albums 2 CREATE TABLE "albums"\r\n(\r\n [AlbumId] IN...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
2 table artists artists 4 CREATE TABLE "artists"\r\n(\r\n [ArtistId] ...
3 table customers customers 5 CREATE TABLE "customers"\r\n(\r\n [Customer...
4 table employees employees 8 CREATE TABLE "employees"\r\n(\r\n [Employee...

The 'column' tbl_name contains the list of available tables in the database

  • The database contains the following tables: albums, sqlite_sequence, artists, customers and employees
In [37]:
query = """
            SELECT * FROM artists
            LIMIT 5
        """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[37]:
ArtistId Name
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette
4 5 Alice In Chains
In [4]:
query = """
            SELECT * FROM playlist_track
        """
results = cursor.execute(query).fetchall()
x = pd.read_sql_query(query, sql_connect)
x.head()
Out[4]:
PlaylistId TrackId
0 1 3402
1 1 3389
2 1 3390
3 1 3391
4 1 3392
In [63]:
# Apply value_counts() function to query results
x.Country.value_counts()
Out[63]:
USA               13
Canada             8
France             5
Brazil             5
Germany            4
United Kingdom     3
India              2
Portugal           2
Czech Republic     2
Netherlands        1
Poland             1
Belgium            1
Norway             1
Finland            1
Ireland            1
Hungary            1
Australia          1
Italy              1
Sweden             1
Chile              1
Argentina          1
Denmark            1
Austria            1
Spain              1
Name: Country, dtype: int64
In [48]:
query = """
            SELECT name, type FROM sqlite_master 
            WHERE type IN('table', 'view')
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[48]:
name type
0 albums table
1 sqlite_sequence table
2 artists table
3 customers table
4 employees table
5 genres table
6 invoices table
7 invoice_items table
8 media_types table
9 playlists table
10 playlist_track table
11 tracks table
12 sqlite_stat1 table
In [86]:
query = """
            SELECT * FROM genres
            LIMIT 5
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[86]:
GenreId Name
0 1 Rock
1 2 Jazz
2 3 Metal
3 4 Alternative & Punk
4 5 Rock And Roll
In [82]:
query = """
            SELECT * FROM tracks
            LIMIT 1
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[82]:
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
In [176]:
query = """
            SELECT * FROM artists
            LIMIT 1
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[176]:
ArtistId Name
0 1 AC/DC
In [83]:
query = """
            SELECT * FROM invoice_items
            LIMIT 1
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[83]:
InvoiceLineId InvoiceId TrackId UnitPrice Quantity
0 1 1 2 0.99 1
In [85]:
query = """
            SELECT * FROM invoices
            LIMIT 5
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[85]:
InvoiceId CustomerId InvoiceDate BillingAddress BillingCity BillingState BillingCountry BillingPostalCode Total
0 1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34 Stuttgart None Germany 70174 1.98
1 2 4 2009-01-02 00:00:00 Ullevålsveien 14 Oslo None Norway 0171 3.96
2 3 8 2009-01-03 00:00:00 Grétrystraat 63 Brussels None Belgium 1000 5.94
3 4 14 2009-01-06 00:00:00 8210 111 ST NW Edmonton AB Canada T6G 2C7 8.91
4 5 23 2009-01-11 00:00:00 69 Salem Street Boston MA USA 2113 13.86
In [296]:
query = """
            SELECT * FROM invoices
            LIMIT 5
            """
results = cursor.execute(query).fetchall()
pd.read_sql_query(query, sql_connect)
Out[296]:
InvoiceId CustomerId InvoiceDate BillingAddress BillingCity BillingState Country BillingPostalCode Total
0 1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34 Stuttgart None Germany 70174 1.98
1 2 4 2009-01-02 00:00:00 Ullevålsveien 14 Oslo None Norway 0171 3.96
2 3 8 2009-01-03 00:00:00 Grétrystraat 63 Brussels None Belgium 1000 5.94
3 4 14 2009-01-06 00:00:00 8210 111 ST NW Edmonton AB Canada T6G 2C7 8.91
4 5 23 2009-01-11 00:00:00 69 Salem Street Boston MA USA 2113 13.86

4. Objectives

Query the Chinook database to answer the following questions:

  • What are the best selling genres?
  • What are the sales by country?
  • Which track and artist exist in most of the playlists?

4.1 Best selling genres

  • Sales is marked by InvoiceID
  • InvoiceID is common to invoices and invoice_items
  • CustomerId is common to customers and invoices
  • TrackId is common to invoice_items and tracks
  • GenreId is common to tracks and genres
In [217]:
q1 = """
WITH t1 AS
(SELECT * FROM invoice_items im
LEFT JOIN invoices iv ON iv.InvoiceId = im.InvoiceId
LEFT JOIN customers cs ON cs.CustomerId = im.InvoiceId)
SELECT gr.Name Genre,
SUM(t1.Quantity) Sales
FROM t1 
t1 LEFT JOIN tracks tr ON tr.TrackId = t1.TrackId
LEFT JOIN genres gr ON gr.GenreId = tr.GenreId

GROUP BY Genre
ORDER BY Sales DESC
Limit 5;
    """
results = cursor.execute(q1).fetchall()
qp = pd.read_sql_query(q1, sql_connect)

fig = px.bar(qp, 
             x = 'Genre',
             y = 'Sales',
             color = 'Genre',
             title = 'Top 5 best selling genres')

fig.show()
In [227]:
# rename table column name for joins 
q = """
ALTER TABLE invoices RENAME BillingCountry TO Country
"""
In [287]:
# Genre sales by country 
q2 = """ WITH t1 AS 
(SELECT * FROM invoice_items im
LEFT JOIN invoices iv ON iv.InvoiceId = im.InvoiceId
LEFT JOIN customers cs ON cs.CustomerId = iv.CustomerId)
SELECT gr.Name Genre, cs.Country Country,
SUM(t1.Quantity) Sales 
FROM t1
t1 LEFT JOIN tracks tr ON tr.TrackId = t1.TrackId
LEFT JOIN genres gr ON gr.GenreID = tr.GenreId
LEFT JOIN customers cs ON cs.Country = t1.Country

GROUP BY Genre
ORDER BY Sales DESC 
"""

results = cursor.execute(q2).fetchall()
qp = pd.read_sql_query(q2, sql_connect)

fig = px.bar(qp, 
             x = 'Country',
             y = 'Sales',
             color = 'Genre',
             title = 'Best selling genre in each country')

fig.show()
In [292]:
# Genre sales by country 
q3 = """ WITH t1 AS 
(SELECT * FROM invoice_items im
LEFT JOIN invoices iv ON iv.InvoiceId = im.InvoiceId
LEFT JOIN customers cs ON cs.CustomerId = iv.CustomerId)
SELECT gr.Name Genre, cs.Country Country,
SUM(t1.Quantity) Sales 
FROM t1
t1 LEFT JOIN tracks tr ON tr.TrackId = t1.TrackId
LEFT JOIN genres gr ON gr.GenreID = tr.GenreId
LEFT JOIN customers cs ON cs.Country = t1.Country

GROUP BY Genre
ORDER BY t1.Country ;
"""

results = cursor.execute(q3).fetchall()
qp = pd.read_sql_query(q3, sql_connect)

fig = px.bar(qp, 
             x = 'Genre',
             y = 'Sales',
             color = 'Country',
             title = 'Genre sale by country')

fig.show()

4.2 Sales performance by country

In [295]:
q4 = """
with t1 AS(
SELECT * FROM invoice_items im
LEFT JOIN invoices iv ON iv.InvoiceId = im.InvoiceId
LEFT JOIN customers cs ON cs.CustomerId = iv.CustomerId
)
SELECT cs.Country Country,
SUM(t1.Quantity) Sales
FROM t1 
LEFT JOIN customers cs ON cs.Country = t1.Country

GROUP BY t1.Country
ORDER BY Sales DESC;
"""

results = cursor.execute(q4).fetchall()
qp = pd.read_sql_query(q4, sql_connect)

fig = px.bar(qp, 
             x = 'Country',
             y = 'Sales',
             color = 'Country',
             title = 'Sale by country')

fig.show()
In [3]:
# average sales 
q5 = """
WITH t1 AS (
    SELECT CASE WHEN
    (
        SELECT COUNT(*) from customers
        WHERE cs.Country = Country
    ) = 1 THEN "Other"
    ELSE cs.Country END "Country", cs.CustomerId,
    iv.*
    FROM invoices iv
    INNER JOIN  customers cs on iv.CustomerId = cs.CustomerId  
)
SELECT Country, total_sales, total_customers, avg_sales, avg_order_value
FROM (
    SELECT Country,
    SUM(total) total_sales,
    COUNT(distinct CustomerId) total_customers,
    SUM(total) / COUNT(distinct CustomerId) avg_sales,
    SUM(total) / COUNT(distinct InvoiceId) avg_order_value
    
    FROM t1
    GROUP BY Country
    ORDER BY total_sales DESC
    )
     """

results = cursor.execute(q5).fetchall()
qp = pd.read_sql_query(q5, sql_connect)

fig = px.bar(qp, 
             x = 'Country',
             y = 'total_sales',
             color = 'Country',
             title = 'Total sales by country')

fig.show()

4.3 Artist popularity

  • Join using TrackId > AlbumId > ArtistId > GenreId > PlaylistId
  • Common columns
In [2]:
query = """
            SELECT * FROM playlist_track
        """
results = cursor.execute(query).fetchall()
x = pd.read_sql_query(query, sql_connect)
x.head()
Out[2]:
PlaylistId TrackId
0 1 3402
1 1 3389
2 1 3390
3 1 3391
4 1 3392
In [3]:
q6 = """
SELECT at.Name 'artist', 
COUNT(p.PlaylistId) 'count', 
p.Name 'playlist', 
gn.Name 'genre'
FROM playlist_track pl
    LEFT JOIN tracks tr ON tr.TrackId = tr.TrackId
    LEFT JOIN albums al ON al.AlbumId = tr.AlbumId
    LEFT JOIN artists at ON at.ArtistId = al.ArtistId
    LEFT JOIN genres gn ON tr.GenreId = gn.GenreId
    LEFT JOIN playlists p ON p.PlaylistId = pl.PlaylistId
    
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
        """
results = cursor.execute(q6).fetchall()
qp = pd.read_sql_query(q6, sql_connect)
In [4]:
qp
Out[4]:
artist count playlist genre
0 Iron Maiden 1856295 Music Rock
1 U2 1176525 Music Rock
2 Led Zeppelin 993510 Music Rock
3 Metallica 976080 Music Metal
4 Deep Purple 801780 Music Rock
5 Lost 801780 Music TV Shows
6 Pearl Jam 583905 Music Rock
7 Lenny Kravitz 496755 Music Rock
8 Various Artists 488040 Music Pop
9 The Office 461895 Music TV Shows
In [6]:
fig = px.bar(qp, 
             x = 'artist',
             y = 'count',
             color = 'genre',
             title = 'Artist popularity in playlists')

fig.show()